The aim of my investigation is to see if any variables affect volatile acidity which in turn affects the quality of the white wine. The dataset consists of 12 variables and 4898 observations.

# Load all of the packages 

library(ggplot2)
library(reshape)
library(corrplot)
# Load the Data
wine <- read.csv('wineQualityWhites.csv')

Data Summary

The dataset description is shown below. We created a new variable called bound sulfur dioxide which is nothing but total sulfur dioxide subtracted by the free sulfur dioxide.

#Renamed varible X to Wine.ID
wine$X <- NULL

#Created "Bound Sulphur dioxide" varible
wine <- within(wine, bound.sulfur.dioxide <- total.sulfur.dioxide - free.sulfur.dioxide )

#Structure of Data
dim(wine)
## [1] 4898   13
str(wine)
## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...
#summary of Data
summary(wine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality      bound.sulfur.dioxide
##  Min.   :3.000   Min.   :  4.0       
##  1st Qu.:5.000   1st Qu.: 78.0       
##  Median :6.000   Median :100.0       
##  Mean   :5.878   Mean   :103.1       
##  3rd Qu.:6.000   3rd Qu.:125.0       
##  Max.   :9.000   Max.   :331.0

Now, I will be performing Univariate, Bivariate and Multivariate analysis.

Univariate Analysis

#Function to generate ggplots of some features

univ_cont <- function(feat) {
    ggplot(data=wine, aes_string(x = feat)) + geom_histogram()
}

uni_va <- univ_cont("volatile.acidity")
uni_ph <- univ_cont("pH")
uni_den <- univ_cont("density")
uni_alc <- univ_cont("alcohol")
uni_sul <- univ_cont("sulphates")

Volatile Acidity

#Histogram chat of volatile.acidity
plot(uni_va)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram chart of volatile.acidity - outliers removed
ggplot(aes(x = volatile.acidity), data = wine)+
  geom_histogram(binwidth = 0.01)+
  coord_trans(y = 'sqrt')+
  scale_x_continuous(limits = c(0.1,0.70), breaks = seq(0.1,0.70,0.1))
## Warning: Removed 24 rows containing non-finite values (stat_bin).

The distribution appears unimodal with the volatile acidity peaking around 0.28.


Quality

Is there any effect on the quality? What does this plot looks like across the categorical variables of quality.

#Bar chart of quality
ggplot(aes(x = quality), data = wine)+
  geom_bar()+
  scale_x_continuous(limits = c(0,10), breaks = seq(0,10,1))

The majority of white wines have a quality level 5 and 6.


pH Level

#Bar chart of pH level
plot(uni_ph)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram chart of PH level - outliers removed
ggplot(aes(x = pH), data = wine)+
  geom_histogram(binwidth = 0.01)+
  scale_x_continuous(limits = c(2.8,3.6), breaks = seq(3,3.6,0.05))
## Warning: Removed 49 rows containing non-finite values (stat_bin).

#Summary chart of pH level
summary(wine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
table(wine$volatile.acidity)
## 
##  0.08 0.085  0.09   0.1 0.105  0.11 0.115  0.12 0.125  0.13 0.135  0.14 
##     4     1     1     6     6    13     3    34     3    44     1    56 
## 0.145  0.15 0.155  0.16 0.165  0.17 0.175  0.18 0.185  0.19   0.2 0.205 
##     4    88     5   141     2   140     1   177     5   170   214     4 
##  0.21 0.215  0.22 0.225  0.23 0.235  0.24 0.245  0.25 0.255  0.26 0.265 
##   191     1   229     4   216     4   253     4   231    10   240     5 
##  0.27 0.275  0.28 0.285  0.29 0.295   0.3 0.305  0.31 0.315  0.32 0.325 
##   218     3   263     5   160     3   198     4   148     4   182     2 
##  0.33 0.335  0.34 0.345  0.35 0.355  0.36 0.365  0.37 0.375  0.38 0.385 
##   134     7   135     9    86     1   104     2    65     2    63     2 
##  0.39 0.395   0.4 0.405  0.41 0.415  0.42 0.425  0.43 0.435  0.44 0.445 
##    61     2    59     1    54     4    36     2    35     2    46     4 
##  0.45 0.455  0.46  0.47 0.475  0.48 0.485  0.49 0.495   0.5  0.51  0.52 
##    25     2    30    15     3    17     3    14     2    14    10    10 
##  0.53  0.54 0.545  0.55 0.555  0.56  0.57  0.58 0.585  0.59 0.595   0.6 
##     8    10     1    14     2     9     4     7     2     4     2     7 
##  0.61 0.615  0.62  0.63  0.64  0.65 0.655  0.66  0.67  0.68 0.685  0.69 
##     7     4     5     2     7     2     3     4     5     3     1     2 
## 0.695 0.705  0.71  0.73  0.74  0.75  0.76  0.78 0.785 0.815  0.85 0.905 
##     3     2     1     1     1     1     2     1     1     1     1     1 
##  0.91  0.93 0.965 1.005   1.1 
##     1     1     1     1     1

There is a peak around 3.14. The pH level is probably affected by acidity. Minimum level of pH is 2.720 and maximum is 3.820.


Density

#Histogram of density
plot(uni_den)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram of density - outliers removed
ggplot(aes(x = density), data = wine)+
  geom_histogram(binwidth = 0.0002)+
  scale_x_continuous(limits = c(0.988,1.001), breaks = seq(0.988,1.001,0.001))+
  coord_trans(y = 'sqrt')
## Warning: Removed 28 rows containing non-finite values (stat_bin).

#Summary data of density
summary(wine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a very small range, from 0.9871 to 1.0390


Alcohol percentage by volume

#Bar chart of alcohol percentage
plot(uni_alc)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Bar chart of alcohol percentage - outliers removed
ggplot(aes(x = alcohol), data = wine)+
  geom_histogram(binwidth = 0.1)+
  scale_x_continuous(limits = c(8.5,13.6), breaks = seq(8.5,13.6))+
  coord_trans(y = 'sqrt')
## Warning: Removed 24 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

#Summary data of alcohol percentage
summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

There is a peak around 9.4, and the distribution is skewed to the right.


Sulphates

#Histogram of sulphates
plot(uni_sul)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram of sulphates - outliers removed
ggplot(aes(x = sulphates), data = wine)+
  geom_histogram(binwidth = 0.005)+
  scale_x_continuous(limits = c(0.15,0.9), breaks = seq(0.15,0.9,0.05))+
  coord_trans(y = 'sqrt')
## Warning: Removed 24 rows containing non-finite values (stat_bin).

table(wine$sulphates)
## 
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 
##    1    1    4    4   13   13   16   31   35   54   59   84   85  120  129 
## 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##  214  151  168  139  181  161  216  178  225  172  179  166  249  140  156 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##  135  167  102  108   83   99   97   88   45   68   48   67   28   36   35 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   44   30   27   18   33   12   19   22   19   16   19   16    5    5   13 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99 
##    2    4    3    2    2    7    1    5    2    2    5    3    1    6    1 
##    1 1.01 1.06 1.08 
##    1    1    1    1

There is a peak around 0.55. Distribution is skewed to the right.

The distribution appears slightly bi-modal with the sulphate concentration peaking around 0.38 and again at 0.5.


Q & A

What is the structure of your dataset?

Data-frame consists of 4898 white wines of 12 original variables (Wine id, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality) + 1 derived variable(Bound Sulphur dioxide). The variable quality is ordered factor variable with the following levels.

Quality: (Worst) 0, 1, ———> , 9,10 (Best)

Salient observations:

  • Most white wines have a quality of 5 or 6
  • Median pH level is 3.180
  • Majority of white wines have between 9 and 13 percent of alcohol

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is volatile acidity. I wanted to find out how volatile acidity increase or decrease w.r.t the quality of the white wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade white wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I would like to see if the amount residual sugar increases the quality of the white wine, and also if there is any connection with the amount of alcohol in the wine itself.

Did you create any new variables from existing variables in the dataset?

A new variable was created named “bound.sulfur.dioxide”. It was shown in the summary of the data frame and was later used in the bi-variate plots section.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I found that the alcohol percentage distribution was right skewed compared to the other variables that I investigated. Most of the white wines were below 13% of alcohol. In most of the cases, I removed the outliers to get a better look at the data.

Bivariate Analysis

#Correlation matrix using pearson method
round(cor(wine, method = 'pearson'),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.023       0.289
## volatile.acidity            -0.023            1.000      -0.149
## citric.acid                  0.289           -0.149       1.000
## residual.sugar               0.089            0.064       0.094
## chlorides                    0.023            0.071       0.114
## free.sulfur.dioxide         -0.049           -0.097       0.094
## total.sulfur.dioxide         0.091            0.089       0.121
## density                      0.265            0.027       0.150
## pH                          -0.426           -0.032      -0.164
## sulphates                   -0.017           -0.036       0.062
## alcohol                     -0.121            0.068      -0.076
## quality                     -0.114           -0.195      -0.009
## bound.sulfur.dioxide         0.136            0.157       0.102
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.089     0.023              -0.049
## volatile.acidity              0.064     0.071              -0.097
## citric.acid                   0.094     0.114               0.094
## residual.sugar                1.000     0.089               0.299
## chlorides                     0.089     1.000               0.101
## free.sulfur.dioxide           0.299     0.101               1.000
## total.sulfur.dioxide          0.401     0.199               0.616
## density                       0.839     0.257               0.294
## pH                           -0.194    -0.090              -0.001
## sulphates                    -0.027     0.017               0.059
## alcohol                      -0.451    -0.360              -0.250
## quality                      -0.098    -0.210               0.008
## bound.sulfur.dioxide          0.345     0.194               0.264
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.091   0.265 -0.426    -0.017  -0.121
## volatile.acidity                    0.089   0.027 -0.032    -0.036   0.068
## citric.acid                         0.121   0.150 -0.164     0.062  -0.076
## residual.sugar                      0.401   0.839 -0.194    -0.027  -0.451
## chlorides                           0.199   0.257 -0.090     0.017  -0.360
## free.sulfur.dioxide                 0.616   0.294 -0.001     0.059  -0.250
## total.sulfur.dioxide                1.000   0.530  0.002     0.135  -0.449
## density                             0.530   1.000 -0.094     0.074  -0.780
## pH                                  0.002  -0.094  1.000     0.156   0.121
## sulphates                           0.135   0.074  0.156     1.000  -0.017
## alcohol                            -0.449  -0.780  0.121    -0.017   1.000
## quality                            -0.175  -0.307  0.099     0.054   0.436
## bound.sulfur.dioxide                0.922   0.504  0.003     0.136  -0.427
##                      quality bound.sulfur.dioxide
## fixed.acidity         -0.114                0.136
## volatile.acidity      -0.195                0.157
## citric.acid           -0.009                0.102
## residual.sugar        -0.098                0.345
## chlorides             -0.210                0.194
## free.sulfur.dioxide    0.008                0.264
## total.sulfur.dioxide  -0.175                0.922
## density               -0.307                0.504
## pH                     0.099                0.003
## sulphates              0.054                0.136
## alcohol                0.436               -0.427
## quality                1.000               -0.218
## bound.sulfur.dioxide  -0.218                1.000

I noticed from the Pearson correlation above that the strongest correlations with volatile acidity are bound sulfur dioxide and quality. The correlation coefficients are 0.157 and -0.195, respectively. Let’s look at the visual representation of the correlations.

#Correlation plot
cm <- round(cor(wine, method = 'pearson'),3)

corrplot(cm, method = "circle")

We can clearly see from the size and color of the circles that volatile acidity has the strongest correlation with citric acid, quality, and bound sulfur dioxide, as stated above. Thus, the next step will be making bi-variate plot for each of the four variables

Volatile Acidity v/s Citric Acid

#Jitter Plot of citric.acid vs volatile acidity
vola <- ggplot(aes(x = citric.acid, y = volatile.acidity), data = wine)

vola + geom_jitter()

#Jitter Plot of citric.acid vs volatile acidity - outliers removed
vola + geom_jitter(alpha = 1/5)+
  scale_x_continuous(limits = c(0,0.75), breaks = seq(0,0.75,0.05))+
  geom_smooth()
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).

The amount of volatile acidity decreases as citric acid increases. Could the citric acid have an effect on the taste of the white wine?

Volatile Acidity v/s Quality

#Box Plot of quality vs volatile acidity
qua <- ggplot(aes(x = factor(quality), y = volatile.acidity), data = wine)

qua + geom_boxplot()+
  geom_jitter(position=position_jitter(width=.1, height=0))

#Summary of quality vs volatile acidity
by(wine$volatile.acidity, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

The amount of volatile acidity in level 4 of quality would confirm how volatile acidity affects the taste of the wine.

Volatile Acidity v/s Bound Sulfur Dioxide

#Jitter Plot of bound.sulfur.dioxide vs volatile acidity
bound <- ggplot(aes(x = bound.sulfur.dioxide, y = volatile.acidity), data = wine)

bound + geom_jitter()

#Jitter Plot of bound.sulfur.dioxide vs volatile acidity - outliers removed
bound + geom_jitter(alpha = 1/3)+
  scale_y_continuous(limits = c(0.10,0.9))+
  scale_x_continuous(limits = c(25,250), breaks = seq(25,250,10))+
  geom_smooth()
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 38 rows containing missing values (geom_point).

The amount of volatile acidity increases as bound sulfur dioxide increases.

Let’s also look into alcohol against quality.

#Jitter Plot of alcohol vs quality
alc_qua <- ggplot(aes(x = quality, y = alcohol), data = wine)
alc_qua + geom_jitter()

#Jitter Plot of alcohol vs quality
alc_qua <- ggplot(aes(x = jitter(quality), y = alcohol), data = wine)
alc_qua + geom_jitter(alpha = 1/3)+
  scale_x_continuous(breaks = seq(0,10,1))+
  geom_smooth()

Interestingly we observe a trend : as the alcohol percentage increases so do the quality.

#Jitter Plot of free.sulfur.dioxide vs bound.sulfur.dioxide
fb <- ggplot(aes(x = free.sulfur.dioxide, y = bound.sulfur.dioxide), data = wine)
fb + geom_jitter(alpha = 1/3)+
  scale_x_continuous(limits = c(0,80), breaks = seq(0,80,10))+
  geom_smooth()
## Warning: Removed 50 rows containing non-finite values (stat_smooth).
## Warning: Removed 50 rows containing missing values (geom_point).

Visual of bound vs free sulfur dioxide, showing a positive correlation.

Q & A

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Volatile acidity correlates strongly with citric acid and bound sulfur dioxide.

The amount of volatile acidity decreases as citric acid increases, but the data was widely spread and only showing small clusters of data.

The overlay of jitter data on top of the box plot of volatile acidity against quality create a good visual for comparison of the different qualities.

The visual for volatile acidity against bound sulfur dioxide didn’t really show a good explanation as the data was widely spread, but did show a increase of volatile acidity when bound sulfur dioxide had increased a lot.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

With the new variable that I created, it show good correlation between free sulfur dioxide and bound sulfur dioxide. Also, alcohol against quality showed that as the alcohol percentage increases so do the quality.

What was the strongest relationship you found?

The level of volatile acidity showed a negative correlation with quality showing that the quality of white wine increased.

Multivariate Analysis

Citric Acid v/s Volatile Acidity factored by Quality

#Jitter Plot of citric.acid vs volatile acidity factored by quality
gpfq <- geom_point(aes(color = factor(quality)))
vola + gpfq + scale_color_brewer(palette = "Reds")+
  theme_dark()

The volatile acidity plot elaborate on the odd trends that were seen in the box plots earlier. Most quality levels 6 and above do not exceed 0.75 of volatile acidity.

Bound Sulfur Dioxide vs Volatile Acidity factored by Quality

#Jitter Plot of bound.sulfur.dioxide vs volatile acidity
bound + gpfq+
  scale_y_continuous(limits = c(0.10,0.9))+
  scale_x_continuous(limits = c(25,250), breaks = seq(25,250,20))+
  scale_color_brewer(palette = "Greens")+
  theme_dark()
## Warning: Removed 35 rows containing missing values (geom_point).

Most of the different qualities are wide spread but there does seem to be a large grouping from 45-170 grams of bound sulfur dioxide.

Q & A

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The citric acid plot against volatile acidity showed a good correlation as the quality of white wine increased, even though the correlation was negative.

Were there any interesting or surprising interactions between features?

Surprisingly, we see that higher quality wines are having lower bound sulfur dioxide, which can be seen by difference in shades of green in plot.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.

Final Plots and Summary

Plot One

ggplot(aes(x = volatile.acidity), data = wine)+
  geom_histogram(binwidth = 0.01)+
  scale_x_continuous(limits = c(0.1,0.7), breaks = seq(0.1,0.7,0.1))+
  labs(list(title = "Volatile Acidity in White Wine", x = "Volatile Acidity(g/dm3)", y = "Count of White Wines"))
## Warning: Removed 24 rows containing non-finite values (stat_bin).

Description One

The distribution of volatile acidity appear to be unimodal. There is a curious spike around 0.28.

Plot Two

alc_qua + geom_jitter(alpha = 1/3)+
  scale_x_continuous(breaks = seq(0,10,1))+
  geom_smooth()+
  labs(list(title = "Quality of Alcohol in White Wine", x = "Quality(0 to 10)", y = "Alcohol (%)"))

Description Two

The quality level of different white wines confirmed that as the level increased the volatile acidity was reduced.

Plot Three

vola + gpfq +
  scale_color_brewer(palette = "Blues")+
  theme_dark()+
  labs(list(title = "Quality of Volatile Acidity vs Citric Acid in White Wine", x = "Citric Acid(g/dm3)", y = "Volatile Acidity(g/dm3)", colour = "Quality of Wine"))

Description Three

The quality of wine increases as we move towards the lower right of the plot. Wine seems to have better quality when citric acid is around 0.15 and volatile acidity is 0.3.


Reflection

This data set contains information on 4,898 different white wines from a 2009 study. My goal was to find which chemical properties affected the volatile acidity in the white wine. I started out by exploring the distribution of individual variables and looked for unusual behaviors in the histograms. I then calculated and plotted the correlations between volatile acidity and the variables. None of the correlations were above 0.5. The two variables that had relatively strong correlations were citric acidity and bound sulfur dioxide, but the individual correlations were not strong enough to make definitive conclusions with only bi-variate analysis methods. However, plotting the multivariate plot shown as Final Plot 3 showed the increase in quality with certain citric acidity values. One suggestion for this data set is to include storage time and storage method since these factors can influence the quality of wine as well. Further studies might include the relationship between price and quality of wine to investigate whether expensive wines lead to better quality.